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Definitions: 

The value of an entry in a row and a column is the data entered in that cell. 
The dictionary for a column has an entry for each different value in the column. 
The width of a column is the number of bits used to specify its entries. 
The cardinality of a column is the number of different values in the its rows. 
Given: Table column with n rows and cells k bits wide 
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Determine cardinality m of column, where 
m is number of different entries in column 
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Create dictionary for column entries 
Dictionary has m rows and width k bits 
Dictionary line numbers have width w bits 
such that if p = 2 A w then m < p 
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Rewrite column using dictionary references 
Reset column width to w bits 
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Check that w is minimum, 
where w = log2 p and p > m 




FIG. 1A 
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Column A 
Dictionary 



Total m rows for different values 
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Width k bits 
for complete 
values 



Total 
n rows, 
m different 
values 



Column A 



Total p 
different 
possible 
values: 
p = 2 A w 



Width w bits 



Figure 1B. Minimizing individual column 
width 
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Column A 



Column A 



Decrement 
by 1 in loop 
to reduce 
width 



lfm>p/2 
w is min 
stop 
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Given: Table columns C1 and C2, each with n rows 
C1 is a list of document IDs d1 i, for i = 1 , n 
C2 is a list of document IDs d2i, for i = 1 ..... n 
The document ID of a column is the row number. 
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Define dictionaries D1 and D2 for C1 and C2 
Dictionaries are ordered by value IDs 
(alphanumeric by doc contents). 
The value ID for a value in a dictionary is the row number of 
its entry in the dictionary. 
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Define combined dictionary D12 listing 
pairs [ d1i, d2i ] of doc IDs from columns 
C1 and C2 for each i = 1 , .... n 
and ordered by respective value IDs 
(alphanumeric order for d1 then for d2) 



Create combined column C12 using 
dictionary D12 line references and 
ordered by doc ID pairs for i = 1 , n 
(C12 order is same as C1 and C2) 



Delete columns C1 and C2 
All their info is now in D1, D2, D12 
and new combined column C12 
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FIG. 2A 



C Memory footprint of combined column C12 
plus dictionaries is generally much less than 
memory footprint of columns C1 and C2 
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Ordered by value ID 
Ordered by doc ID 
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Coiumn 1 
Documents 



1 

5 NULL 



Column 2 
Documents 
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Column 12 
Dictionary 
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Col 1 Col 2 
doc doc 
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Column 12 
Documents 
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Figure 2B. Combining columns to save space 
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Memory required (bits) 
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Original 1 


n * w1 


ml *k1 
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Original 2 


n * w2 


m2*k2 
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Combined 12 (worst case) 


n * (w1 + w2) 


ml * m2 * (w1 + w2) 
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Combined 12 (best case) 


n * wm 


mm * (w1 + w2) 
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Key 
301 

n Number of rows in original columns (and in original table) 

wj Width of column j in bits (minimized as in method 100, Fig. 1 A) 

mj Cardinality of column j (i.e., number of different values in column j) 

kj Width of widest value in column j in bits (typically, kj > wj) 

mm Maximum of ml and m2 (i.e., larger of the two values) 

wm Maximum of w1 and w2 (i.e., larger of the two values) 



FIG. 3 



